84 research outputs found
Faster Depth-Adaptive Transformers
Depth-adaptive neural networks can dynamically adjust depths according to the
hardness of input words, and thus improve efficiency. The main challenge is how
to measure such hardness and decide the required depths (i.e., layers) to
conduct. Previous works generally build a halting unit to decide whether the
computation should continue or stop at each layer. As there is no specific
supervision of depth selection, the halting unit may be under-optimized and
inaccurate, which results in suboptimal and unstable performance when modeling
sentences. In this paper, we get rid of the halting unit and estimate the
required depths in advance, which yields a faster depth-adaptive model.
Specifically, two approaches are proposed to explicitly measure the hardness of
input words and estimate corresponding adaptive depth, namely 1) mutual
information (MI) based estimation and 2) reconstruction loss based estimation.
We conduct experiments on the text classification task with 24 datasets in
various sizes and domains. Results confirm that our approaches can speed up the
vanilla Transformer (up to 7x) while preserving high accuracy. Moreover,
efficiency and robustness are significantly improved when compared with other
depth-adaptive approaches.Comment: AAAI-2021. Code will appear at:
https://github.com/Adaxry/Adaptive-Transforme
BJTU-WeChat's Systems for the WMT22 Chat Translation Task
This paper introduces the joint submission of the Beijing Jiaotong University
and WeChat AI to the WMT'22 chat translation task for English-German. Based on
the Transformer, we apply several effective variants. In our experiments, we
utilize the pre-training-then-fine-tuning paradigm. In the first pre-training
stage, we employ data filtering and synthetic data generation (i.e.,
back-translation, forward-translation, and knowledge distillation). In the
second fine-tuning stage, we investigate speaker-aware in-domain data
generation, speaker adaptation, prompt-based context modeling, target denoising
fine-tuning, and boosted self-COMET-based model ensemble. Our systems achieve
0.810 and 0.946 COMET scores. The COMET scores of English-German and
German-English are the highest among all submissions.Comment: Accepted by WMT 2022 as a system pape
Cross-Align: Modeling Deep Cross-lingual Interactions for Word Alignment
Word alignment which aims to extract lexicon translation equivalents between
source and target sentences, serves as a fundamental tool for natural language
processing. Recent studies in this area have yielded substantial improvements
by generating alignments from contextualized embeddings of the pre-trained
multilingual language models. However, we find that the existing approaches
capture few interactions between the input sentence pairs, which degrades the
word alignment quality severely, especially for the ambiguous words in the
monolingual context. To remedy this problem, we propose Cross-Align to model
deep interactions between the input sentence pairs, in which the source and
target sentences are encoded separately with the shared self-attention modules
in the shallow layers, while cross-lingual interactions are explicitly
constructed by the cross-attention modules in the upper layers. Besides, to
train our model effectively, we propose a two-stage training framework, where
the model is trained with a simple Translation Language Modeling (TLM)
objective in the first stage and then finetuned with a self-supervised
alignment objective in the second stage. Experiments show that the proposed
Cross-Align achieves the state-of-the-art (SOTA) performance on four out of
five language pairs.Comment: Accepted by EMNLP 202
Improving Translation Faithfulness of Large Language Models via Augmenting Instructions
Large Language Models (LLMs) present strong general capabilities, and a
current compelling challenge is stimulating their specialized capabilities,
such as machine translation, through low-cost instruction tuning. The standard
instruction-following data is sequentially organized as the concatenation of an
instruction, an input, and a response. As the attention mechanism of LLMs has
limitations on local focus, LLMs tend to focus more on the words or sentences
nearby at each position. This leads to a high risk of instruction forgetting
during decoding. To alleviate the above issues, We propose SWIE
(Segment-Weighted Instruction Embedding) and an instruction-following dataset
OVERMISS. SWIE improves the model instruction understanding by adding a global
instruction representation on the following input and response representations.
OVERMISS improves model faithfulness by comparing over-translation and
miss-translation results with the correct translation. We apply our methods to
two main-stream open-source LLMs, BLOOM and LLaMA. The experimental results
demonstrate significant improvements in translation performance with SWIE based
on BLOOMZ-3b, particularly in zero-shot and long text translations due to
reduced instruction forgetting risk. Additionally, OVERMISS outperforms the
baseline in translation performance (e.g. an increase in BLEU scores from 0.69
to 3.12 and an average improvement of 0.48 percentage comet scores for
LLaMA-7b) with further enhancements seen in models combining OVERMISS and SWIE
(e.g. the BLUE scores increase up to 0.56 from English to German across three
different backbones), and both exhibit improvements in the faithfulness metric
based on word alignment.Comment: Our code and datasets are released in Github:
https://github.com/pppa2019/swie_overmiss_llm4m
DTV: Dual Knowledge Distillation and Target-oriented Vision Modeling for Many-to-Many Multimodal Summarization
Many-to-many multimodal summarization (MS) task aims to generate
summaries in any language with document inputs in any language and the
corresponding image sequence, which essentially comprises multimodal
monolingual summarization (MMS) and multimodal cross-lingual summarization
(MXLS) tasks. Although much work has been devoted to either MMS or MXLS and has
obtained increasing attention in recent years, little research pays attention
to the MS task. Besides, existing studies mainly focus on 1) utilizing MMS
to enhance MXLS via knowledge distillation without considering the performance
of MMS or 2) improving MMS models by filtering summary-unrelated visual
features with implicit learning or explicitly complex training objectives. In
this paper, we first introduce a general and practical task, i.e., MS.
Further, we propose a dual knowledge distillation and target-oriented vision
modeling framework for the MS task. Specifically, the dual knowledge
distillation method guarantees that the knowledge of MMS and MXLS can be
transferred to each other and thus mutually prompt both of them. To offer
target-oriented visual features, a simple yet effective target-oriented
contrastive objective is designed and responsible for discarding needless
visual information. Extensive experiments on the many-to-many setting show the
effectiveness of the proposed approach. Additionally, we will contribute a
many-to-many multimodal summarization (MSum) dataset.Comment: EMNLP 2023 Finding
Is ChatGPT a Good NLG Evaluator? A Preliminary Study
Recently, the emergence of ChatGPT has attracted wide attention from the
computational linguistics community. Many prior studies have shown that ChatGPT
achieves remarkable performance on various NLP tasks in terms of automatic
evaluation metrics. However, the ability of ChatGPT to serve as an evaluation
metric is still underexplored. Considering assessing the quality of natural
language generation (NLG) models is an arduous task and NLG metrics notoriously
show their poor correlation with human judgments, we wonder whether ChatGPT is
a good NLG evaluation metric. In this report, we provide a preliminary
meta-evaluation on ChatGPT to show its reliability as an NLG metric. In detail,
we regard ChatGPT as a human evaluator and give task-specific (e.g.,
summarization) and aspect-specific (e.g., relevance) instruction to prompt
ChatGPT to evaluate the generated results of NLG models. We conduct experiments
on five NLG meta-evaluation datasets (including summarization, story generation
and data-to-text tasks). Experimental results show that compared with previous
automatic metrics, ChatGPT achieves state-of-the-art or competitive correlation
with human judgments in most cases. In addition, we find that the effectiveness
of the ChatGPT evaluator might be influenced by the creation method of the
meta-evaluation datasets. For the meta-evaluation datasets which are created
greatly depending on the reference and thus are biased, the ChatGPT evaluator
might lose its effectiveness. We hope our preliminary study could prompt the
emergence of a general-purposed reliable NLG metric.Comment: Both first authors contributed equally. Technical Report, 11 pages.
Accepted to the 4th New Frontiers in Summarization Workshop (NewSumm@EMNLP
2023
Self-Supervised Intensity-Event Stereo Matching
Event cameras are novel bio-inspired vision sensors that output pixel-level
intensity changes in microsecond accuracy with a high dynamic range and low
power consumption. Despite these advantages, event cameras cannot be directly
applied to computational imaging tasks due to the inability to obtain
high-quality intensity and events simultaneously. This paper aims to connect a
standalone event camera and a modern intensity camera so that the applications
can take advantage of both two sensors. We establish this connection through a
multi-modal stereo matching task. We first convert events to a reconstructed
image and extend the existing stereo networks to this multi-modality condition.
We propose a self-supervised method to train the multi-modal stereo network
without using ground truth disparity data. The structure loss calculated on
image gradients is used to enable self-supervised learning on such multi-modal
data. Exploiting the internal stereo constraint between views with different
modalities, we introduce general stereo loss functions, including disparity
cross-consistency loss and internal disparity loss, leading to improved
performance and robustness compared to existing approaches. The experiments
demonstrate the effectiveness of the proposed method, especially the proposed
general stereo loss functions, on both synthetic and real datasets. At last, we
shed light on employing the aligned events and intensity images in downstream
tasks, e.g., video interpolation application.Comment: This paper has been accepted by the Journal of Imaging Science &
Technolog
- …